Goto

Collaborating Authors

 company name


DRBench: A Realistic Benchmark for Enterprise Deep Research

Abaskohi, Amirhossein, Chen, Tianyi, Muñoz-Mármol, Miguel, Fox, Curtis, Ramesh, Amrutha Varshini, Marcotte, Étienne, Lù, Xing Han, Chapados, Nicolas, Gella, Spandana, Pal, Christopher, Drouin, Alexandre, Laradji, Issam H.

arXiv.org Artificial Intelligence

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.


JEL: A Novel Model Linking Knowledge Graph entities to News Mentions

Kishelev, Michael, Bhadani, Pranab, Ding, Wanying, Chaudhri, Vinay

arXiv.org Artificial Intelligence

We present JEL, a novel computationally efficient end-to-end multi-neural network based entity linking model, which beats current state-of-art model. Knowledge Graphs have emerged as a compelling abstraction for capturing critical relationships among the entities of interest and integrating data from multiple heterogeneous sources. A core problem in leveraging a knowledge graph is linking its entities to the mentions (e.g., people, company names) that are encountered in textual sources (e.g., news, blogs., etc) correctly, since there are thousands of entities to consider for each mention. This task of linking mentions and entities is referred as Entity Linking (EL). It is a fundamental task in natural language processing and is beneficial in various uses cases, such as building a New Analytics platform. News Analytics, in JPMorgan, is an essential task that benefits multiple groups across the firm. According to a survey conducted by the Innovation Digital team 1 , around 25 teams across the firm are actively looking for news analytics solutions, and more than \$2 million is being spent annually on external vendor costs. Entity linking is critical for bridging unstructured news text with knowledge graphs, enabling users access to vast amounts of curated data in a knowledge graph and dramatically facilitating their daily work.


Advanced Real-Time Fraud Detection Using RAG-Based LLMs

Singh, Gurjot, Singh, Prabhjot, Singh, Maninder

arXiv.org Artificial Intelligence

Artificial Intelligence has become a double edged sword in modern society being both a boon and a bane. While it empowers individuals it also enables malicious actors to perpetrate scams such as fraudulent phone calls and user impersonations. This growing threat necessitates a robust system to protect individuals In this paper we introduce a novel real time fraud detection mechanism using Retrieval Augmented Generation technology to address this challenge on two fronts. First our system incorporates a continuously updating policy checking feature that transcribes phone calls in real time and uses RAG based models to verify that the caller is not soliciting private information thus ensuring transparency and the authenticity of the conversation. Second we implement a real time user impersonation check with a two step verification process to confirm the callers identity ensuring accountability. A key innovation of our system is the ability to update policies without retraining the entire model enhancing its adaptability. We validated our RAG based approach using synthetic call recordings achieving an accuracy of 97.98 percent and an F1score of 97.44 percent with 100 calls outperforming state of the art methods. This robust and flexible fraud detection system is well suited for real world deployment.


Spurious Forgetting in Continual Learning of Language Models

Zheng, Junhao, Cai, Xidi, Qiu, Shengjie, Ma, Qianli

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of "spurious forgetting", proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.


An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks

Yu, Yijiong

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts and have almost perfect performance in traditional retrieval tasks. However, their performance significantly degrades when it comes to numerical calculations in the long-context. Numeric-involved long-context tasks typically cannot be addressed by current LLMs in normal settings due to their inherent limitations in simultaneously handling complex and massive information. Some CoT like prompting methods can improve accuracy but demands massive output tokens, which is costly and slow. To address this issue, we propose a workflow, which decompose a numeric-involved long-context task into 4 low-level subtasks: judging, extracting and processing with code and conclusion. The former 2 subtasks is relatively simple, which allows us to use smaller models for efficiently processing long context. When numerical calculations are required, we use code generated by LLMs to avoid the disadvantage of LLM not being good at calculations. The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.


Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach

Dolphin, Rian, Dursun, Joe, Chow, Jonathan, Blankenship, Jarrett, Adams, Katie, Pike, Quinton

arXiv.org Artificial Intelligence

Financial news plays a crucial role in decision-making processes across the financial sector, yet the efficient processing of this information into a structured format remains challenging. This paper presents a novel approach to financial news processing that leverages Large Language Models (LLMs) to overcome limitations that previously prevented the extraction of structured data from unstructured financial news. We introduce a system that extracts relevant company tickers from raw news article content, performs sentiment analysis at the company level, and generates summaries, all without relying on pre-structured data feeds. Our methodology combines the generative capabilities of LLMs, and recent prompting techniques, with a robust validation framework that uses a tailored string similarity approach. Evaluation on a dataset of 5530 financial news articles demonstrates the effectiveness of our approach, with 90% of articles not missing any tickers compared with current data providers, and 22% of articles having additional relevant tickers. In addition to this paper, the methodology has been implemented at scale with the resulting processed data made available through a live API endpoint, which is updated in real-time with the latest news. To the best of our knowledge, we are the first data provider to offer granular, per-company sentiment analysis from news articles, enhancing the depth of information available to market participants. We also release the evaluation dataset of 5530 processed articles as a static file, which we hope will facilitate further research leveraging financial news.


Presenting Terrorizer: an algorithm for consolidating company names in patent assignees

Ascione, Grazia Sveva, Sterzi, Valerio

arXiv.org Artificial Intelligence

The problem of disambiguation of company names poses a significant challenge in extracting useful information from patents. This issue biases research outcomes as it mostly underestimates the number of patents attributed to companies, particularly multinational corporations which file patents under a plethora of names, including alternate spellings of the same entity and, eventually, companies' subsidiaries. To date, addressing these challenges has relied on labor-intensive dictionary based or string matching approaches, leaving the problem of patents' assignee harmonization on large datasets mostly unresolved. To bridge this gap, this paper describes the Terrorizer algorithm, a text-based algorithm that leverages natural language processing (NLP), network theory, and rule-based techniques to harmonize the variants of company names recorded as patent assignees. In particular, the algorithm follows the tripartite structure of its antecedents, namely parsing, matching and filtering stage, adding an original "knowledge augmentation" phase which is used to enrich the information available on each assignee name. We use Terrorizer on a set of 325'917 companies' names who are assignees of patents granted by the USPTO from 2005 to 2022. The performance of Terrorizer is evaluated on four gold standard datasets. This validation step shows us two main things: the first is that the performance of Terrorizer is similar over different kind of datasets, proving that our algorithm generalizes well. Second, when comparing its performance with the one of the algorithm currently used in PatentsView for the same task (Monath et al., 2021), it achieves a higher F1 score. Finally, we use the Tree-structured Parzen Estimator (TPE) optimization algorithm for the hyperparameters' tuning. Our final result is a reduction in the initial set of names of over 42%.


Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination

Kang, Haoqiang, Liu, Xiao-Yang

arXiv.org Artificial Intelligence

The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.


Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis

Glasserman, Paul, Lin, Caden

arXiv.org Artificial Intelligence

Large language models (LLMs), including ChatGPT, can extract profitable trading signals from the sentiment in news text. However, backtesting such strategies poses a challenge because LLMs are trained on many years of data, and backtesting produces biased results if the training and backtesting periods overlap. This bias can take two forms: a look-ahead bias, in which the LLM may have specific knowledge of the stock returns that followed a news article, and a distraction effect, in which general knowledge of the companies named interferes with the measurement of a text's sentiment. We investigate these sources of bias through trading strategies driven by the sentiment of financial news headlines. We compare trading performance based on the original headlines with de-biased strategies in which we remove the relevant company's identifiers from the text. In-sample (within the LLM training window), we find, surprisingly, that the anonymized headlines outperform, indicating that the distraction effect has a greater impact than look-ahead bias. This tendency is particularly strong for larger companies--companies about which we expect an LLM to have greater general knowledge. Out-of-sample, look-ahead bias is not a concern but distraction remains possible. Our proposed anonymization procedure is therefore potentially useful in out-of-sample implementation, as well as for de-biased backtesting.


'Painted into a corner': can generative AI save Meta from the metaverse?

The Guardian

Meta is not pivoting away from its signature product, the metaverse. Or at least that's what the Meta chief executive, Mark Zuckerberg, is arguing. Despite reports that sales teams at Meta have spent less time pitching the metaverse to advertisers, Zuckerberg claimed on the tech firm's latest quarterly earnings call that it's business as usual over at the company formerly known as Facebook. "A narrative has developed that we're somehow moving away from focusing on the metaverse vision, so I just want to say upfront that that's not accurate," the CEO said. But neither is the virtual reality world the only product Meta has bet its future on, Zuckerberg argued: "We've been focusing on both AI and the metaverse for years now, and we will continue to focus on both."